The objective of our project was to figure out what variables contribute to Kobe Bryant’s success and accuracy as a player on the Los Angeles Lakers. We figured the best way to go about this was to create sub questions that target each variable. Our plan was to test correlation and figure out what to use in our prediction model. Variables we were interested in looking at included Season (year), his age, how injured he was, his 3-point percentage, 2-point percentage, free throw percentage, shot types, and total points Kobe contributed to the game. The variables that were part of our Kaggle dataset included: shot made flags, combined shot type, period, shot distance, playoffs, season, shot type, shot zone area, shot zone basic, shot zone range, t sec or time remaining in game, home or away, and opponent.
Our sub questions were: - Predicting how many shots Kobe will score in a given game using his accuracy?
What types of shots (layups, jump shots, etc.) does Kobe do best (highest accuracy)?
What are his strengths and weaknesses as a basketball player?
What parts of the court is he the most accurate in? (hotspots on the court)
How clutch was Kobe?
What season(s) was Kobe’s prime?
How Kobe improved/ receded over time?
What is Kobe’s influence on the game’s win or loss?
We chose to explore the dataset titled “Kobe Bryant Shot Selection”, which we got from Kaggle. It contains 25 variables related to the location on the court, action, opponent, date, time, type of game, and much more. We found other Kobe datasets, but chose this one because it seemed the cleanest and also had the most descriptive and clear variables. We chose this dataset because, as sports fans, we thought it would be interesting to work with, and also interesting for others looking at our work, especially since “Kobe” is such a familiar name to everyone. We hoped that highlighting Kobe’s amazing career through his data would help us all better appreciate his impact on both basketball, and the world.
We wanted to visualize the different variables in this dataset that are relevant to the position of the court by plotting them on the court itself.
This visualization shows the position of each shot, and whether that shot was made (red) or not (blue). This shows us that the shots closest to the net are successful more than the shots far away from the net.
This visualization shows how the variable “shot_zone_basic” is in relation to the court. We can see how each different area is defined for each shot taken from it.
This visualization shows the variable “shot_zone_area”, which is the same as “shot_zone_basic”, except now it takes into account the side of the court the shot is taken from.
This visualization shows the variable “shot_zone_range”, which shows the distance the shot was taken from the net.
This visualization shows the variable “combined_shot_type”, and shows where on the court each type of shot is taken.
This bar graph shows the different types of shots, and their accuracy. From this we can see that the shots taken closest to the net have the highest accuracy.
This bar graph shows the distribution of type of shots taken in each season. From this we can see that in the 2005-2006 season, Kobe has the most shots, and in every season, most of his shots were jump shots.
This violin plot shows the density of each type of shot taken, as the time remaining in the period decreases. We can see that as the period ends, tip shots are done the most.
This plot shows Kobe’s accuracy for 2-point shots and 3-point shots for each opponent they play against.
We chose to implement a modeling method that is compatible with tidyverse (Wickham et al., 2019) that was introduced by broom package vignettes (Robinson and Hayes, 2019). The method has changed the classifical “for-loop” depedent way of perform cross-validation (James et al., 2013) to a vectorizing format, which was thought to be more computational and memory efficient (Wickham, 2014). We believed that implemented this method would be advantagous in handling large dataset, which was ~20K observation in our case.
df <- kobe %>%
select(shot_made_flag, shot_distance, t_sec, home, opponent, season) %>%
modelr::crossv_kfold(k = 10) %>% # produce training and test column
mutate(glm = map(train, ~glm(shot_made_flag ~ ., data = .x, family = binomial)), # mapping training to glm
pred = map2(glm, test, ~predict.glm(object = .x, newdata = .y, type = "response")), # predict by test
pred_class = map(pred, ~if_else(.x > 0.5, 1, 0)), # apply a cutoff and obtain prediction class
true_class = map(test, ~{as_tibble(.x)$shot_made_flag}), # extract original label
misclass_error = map2_dbl(pred_class, mis_class, ~mean(.x != .y))) # calculating misclassification error
Briefly, the modelr::crossv_kfold was used to split the data into train and test set and outputed indexes, instead of repeating entries, into column “train” and “test”. Then each step of modeling was stored in columns by creating variables through dplyr::mutate. To “mutate” from one column to the next, a serier of purrr::map_* functions “unpack” variables from each column, applied with an anonymous function and output respective format.
# A tibble: 10 x 8
train test .id glm pred pred_class true_class misclass_error
<named list> <named list> <chr> <named list> <named list> <named list> <named list> <dbl>
1 <resample> <resample> 01 <glm> <dbl [2,570]> <dbl [2,570]> <dbl [2,570]> 0.396
2 <resample> <resample> 02 <glm> <dbl [2,570]> <dbl [2,570]> <dbl [2,570]> 0.405
3 <resample> <resample> 03 <glm> <dbl [2,570]> <dbl [2,570]> <dbl [2,570]> 0.403
4 <resample> <resample> 04 <glm> <dbl [2,570]> <dbl [2,570]> <dbl [2,570]> 0.4
5 <resample> <resample> 05 <glm> <dbl [2,570]> <dbl [2,570]> <dbl [2,570]> 0.400
6 <resample> <resample> 06 <glm> <dbl [2,570]> <dbl [2,570]> <dbl [2,570]> 0.411
7 <resample> <resample> 07 <glm> <dbl [2,570]> <dbl [2,570]> <dbl [2,570]> 0.405
8 <resample> <resample> 08 <glm> <dbl [2,569]> <dbl [2,569]> <dbl [2,569]> 0.390
9 <resample> <resample> 09 <glm> <dbl [2,569]> <dbl [2,569]> <dbl [2,569]> 0.417
10 <resample> <resample> 10 <glm> <dbl [2,569]> <dbl [2,569]> <dbl [2,569]> 0.396
In exploring our data we decided to use hierarchical clustering as we wanted to apply methods learned in class. We decided that the ward.d2 clustering method was best to cluster Kobe’s 20 season career based off shot made flags, combined shot type, period, shot distance, playoffs, season, shot type, shot zone area, shot zone basic, shot zone range, t sec or time remaining in game, home or away, and opponent. Based on ward.d2, we notice that the ward.d2 clusters Kobe’s regular season and playoffs evenly. The purple is a cluster with all regular seasons and the yellow includes a mix of both playoffs and regular seasons.
Instead of the distance method of ward.d2, we wanted to use clustering based on correlation. The clustering based off correlation doesn’t seem to be as effective in clustering as it is not as even as ward.d2 and both clusters have a mix of regular seasons and playoffs.
For the GLM model, we wanted to find the best cutoff in order to get the minimum misclassification error. We ran all the cutoffs and found that the cutoff of 0.523 would lead to the minimum misclassification error of 0.391. The minimum misclassification error would be best to help determine Kobe’s accuracy.
For selecting variables to predict accuracy, we used the forward selection method similar to the homework. First we used period as a predictor of accuracy, then continually added each variable one at a time. We found that period, shot distance, and playoffs were the best variables to predict Kobe’s accuracy. We determine that by the misclassification error which was lowest with period, shot distance, and playoffs as predictors.
How clutch was Kobe? Clutch is something done well when in a crucial situation. For example Kobe hitting a clutch shot to win the game as time runs out. In order to calculate this, we compare accuracy by time remaining in the game. We filtered by the last two minutes of the 4th period or Overtime if included because that is when Kobe would need to be clutch. We created two linear models using fourth quarter or overtime and game time left to predict accuracy for the regular season and playoffs. It’s hard to conclude whether Kobe is clutch because Kobe shot so many shots and a crucial situation is intangible. But we found that fourth quarter or overtime and game time left is significant to predict accuracy for the regular season, but not the playoffs. This may be because there are not as many games in the playoffs compared to the regular season.
For playoffs seasons:
##
## Call:
## lm(formula = accuracy ~ Quarter + gametime_by_seconds, data = Clutchregularseason %>%
## filter(playoffs == "playoffs"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.61811 -0.39617 -0.05204 0.44531 0.67401
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.312496 0.076956 4.061 8.84e-05 ***
## QuarterOvertime 0.145024 0.100721 1.440 0.153
## gametime_by_seconds 0.001350 0.001097 1.230 0.221
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4248 on 118 degrees of freedom
## Multiple R-squared: 0.0267, Adjusted R-squared: 0.01021
## F-statistic: 1.619 on 2 and 118 DF, p-value: 0.2025
For regular seasons:
##
## Call:
## lm(formula = accuracy ~ Quarter + gametime_by_seconds, data = Clutchregularseason %>%
## filter(playoffs == "regular"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63094 -0.15510 -0.00499 0.17751 0.56380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2964657 0.0441531 6.715 1.99e-10 ***
## QuarterOvertime 0.1324562 0.0423870 3.125 0.00205 **
## gametime_by_seconds 0.0018200 0.0005894 3.088 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2908 on 196 degrees of freedom
## Multiple R-squared: 0.083, Adjusted R-squared: 0.07364
## F-statistic: 8.87 on 2 and 196 DF, p-value: 0.0002052
We wanted to calculate Kobe’s game influence so how much he impacted the lakers. We had to add a new dataset to include whether the Lakers won the game or not. We had trouble joining the dataset, because it was raw data and was not as clean as the kaggle data. A lot of time was spent cleaning the data to be able to join rather than exploring the data. We created a linear model using Kobe’s accuracy to predict if the Lakers won or not. We find that Kobe’s accuracy is very significant in predicting whether the Lakers won.
##
## Call:
## glm(formula = WL ~ accuracyByDay, family = binomial, data = wl)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1330 -1.2744 0.8276 0.9900 1.5683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8839 0.1869 -4.730 2.25e-06 ***
## accuracyByDay 3.0502 0.4140 7.367 1.74e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1888.8 on 1411 degrees of freedom
## Residual deviance: 1829.7 on 1410 degrees of freedom
## AIC: 1833.7
##
## Number of Fisher Scoring iterations: 4
We then created a graph of Kobe’s shot made compared to the team win rate. We notice that the more shots Kobe makes the higher the win rate of the lakers. It stood out that when Kobe makes thirteen shots the Lakers have a high win rate of 76%. If Kobe makes more than eighteen shots the Lakers always win.
Predicting how many shots Kobe will score in a given game using his accuracy? - To answer this question required significantly more work, but we were able to predict which variables contributed to his accuracy: time period, shot distance, and whether or not the team was in the playoffs.
What types of shots (layups, jump shots, etc.) does Kobe do best (highest accuracy)? - Kobe’s best shots are dunks and jump shots. This made sense since Kobe is a Slam Dunk champion and won an award for it in 1997. - His weakest shots are tip shots.
What are his strengths and weaknesses as a basketball player? - For this question, we had to keep in mind what Kobe’s position was. Kobe was a shooting guard which means his greatest strength is shooting shots, and that is exactly what our data showed us. While shooting guards tend to be weak in defense, Kobe’s weaknesses were passing and attempting very difficult shots.
What parts of the court is he the most accurate in? - According to our data, Kobe shoots best from the sidelines of the court and when he’s about 8 to 24 ft. away from the basket.
How clutch was Kobe? - This question was inconclusive. Kobe’s accuracy does not change much throughout the quarters regardless of whether it is the regular season or playoffs.
What season(s) was Kobe’s prime? - 2007 to 2010 based on number of points and accuracy.
How Kobe improved/ receded over time? - Kobe seemed to play really well and make points consistently from 2004 to 2010. We see significant drops during 2012-2014. Kobe had gotten a severe Achilles injury in 2013 which could explain the drops and as he got older he didn’t perform as well. He also ended up missing most of 2014 due to surgery. His athletic career came to an end in 2016, which is 2-3 years after a horrible basketball injury.
What is Kobe’s influence on the game’s win or loss? - While our Kaggle data set could help us answer most of these questions, we still needed additional information to answer other related questions. So we looked at some outside datasets and did some additional research as well in order to answer these questions.
In conclusion, we found that our data and discoveries matched the kind of player Kobe was. He was known as a great shooter and carried his team to many victories. His accuracy and success as a player can be attributed to his phenomenal shooting abilities and we as a team had a great time exploring that.
de Vries, A., and Ripley, B.D. (2016). Ggdendro: Create dendrograms and tree diagrams using ’ggplot2’.
Grolemund, G., and Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software 40, 1–25.
Grolemund, G., and Wickham, H. R for Data Science:: Import, Tidy, Transform, Visualize, and Model Data (O’REILLY).
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to statistical learning: With applications in r (Springer).
Robinson, D., and Hayes, A. (2019). Broom: Convert statistical analysis objects into tidy tibbles.
Wickham, H. (2014). Advanced R (Taylor & Francis).
Wickham, H. (2019). Modelr: Modelling functions that work with the pipe.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software 4, 1686.